Optimization for Linear Regression

Today we're going to look at how to find the best linear model.

We've learned a few different ways to say this in the last lecture, but the most "machine learning-y" way is that:

  1. We fixed our model class to be linear models, i.e. we have a functional model where we are trying to predict y from (in the simplest case) a feature x $$ y = f(x)$$ and we assume that $$ f(x) = w_0 + w_1x$$

  2. We fix our loss function to be the squared error loss function. We've written this as $\mathcal{L}$ previously, butyou will also sometimes (often) see the loss function denoted as $J$. In other words, $$ J(w_0,w_1) = \mathcal{L(f_{w_0, w_1})} = \frac{1}{2}\sum_{i=1}^N(y_i - (w_0 + w_1x_i))^2 $$

(Note: the $\frac{1}{2}$ is added for mathematical convenience. Exercise: If we're trying to find the best values of $w_0$ and $w_1$, why doesn't it matter if we have that value there or not?

Then, we need to figure out how to find the best hypothesis/model in our hypothesis/model class! We're going to look at three ways to do that today:

  1. Closed form
  2. Gradient Descent
  3. Stochastic Gradient Descent

Setup

Data

Throughout, we'll use a few different datasets (and don't worry, we'll get back to the real-world one!), but the primary one is displayed below:

Big picture: What are we trying to do?

Before we jump into how to optimize, it's worth thinking through what we're actually trying to do. We are *trying to find the values of $w_0$ and $w_1$ that

One important note: this loss function is convex, and so it has a single, unique, global minimum. Here's a non-convex function:

Some of what we're doing here relies on the convexity of the loss function for a linear model with MSE as a loss function - that is, on the fact that the Ordinary Least Squares (OLS) estimator is convex.

Optimization approach 1: Closed Form

First, some notes:

Exercise (don't peek): how are the points above related to finding the best possible values of $w_0$ and $w_1$?

How to do it - $w_0$ and $w_1$ only

A reminder of our loss function $$J(w_0,w_1) = \frac{1}{2}\sum_{i=1}^N(y_i - (w_0 + w_1x_i))^2$$

Now, we want to minimize the loss. This is an optimization problem:

$$ \mathop{\mathrm{argmin}}_{w_0, w_1} \frac{1}{2}\sum_{i=1}^N(y_i - (w_0 + w_1x_i))^2 $$

We can try to solve this optimization problem by taking the partial derivative of our loss function with respect to both $w_1$ and $w_0$ and setting each of those equal to zero. We can then solve for the respective values; because our function is convex, these will be the values that reach the global minimum of the loss function.

I'm not going to derive this in class, and you are not responsible for knowing how to derive this for the case of only $w_0$ and $w_1$, since this is a special case. But for those interested, there's a nice, detailed derivation here.

The end equations, though (where $\bar{x}$ is the sample mean of $x$ and $\bar{y}$ is the sample mean of $y$:

$$w_1 = \frac{ \sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y}) }{ \sum_{i=1}^N (x_i - \bar{x})^2 } $$

Note that can be derived differently and leads to $\frac{\mathrm{Cov}(x,y)}{\mathrm{Var}(x)}$ ... which is the same thing! (See Shalizi, Section 1.4 if you're curious, you are not responsible for knowing this though).

$$ w_0 = \bar{y} - w_1*\bar{x} $$

Moving to the more general case

In class, we will discuss how if we 1) move $w_0$ into the feature set with a little trick, and then 2) are instead looking at many features instead of one feature, it becomes more clear (although equivalent) to write the loss function in vector/matrix notation, as: $$ J({\bf w}) = \frac{1}{2}({\bf y} - {\bf Xw})^\top({\bf y} - {\bf Xw}) $$

If we take the derivative of this loss function and set that equal to zero, we can derive that the loss function is maximized when: $$ {\bf w} = ({\bf XX^\top})^{-1}{\bf Xy^\top} $$

I will show this derivation in class, largely following Dr. Chandola's derivation in his scans.

However, let's try to implement this in code below!

Closed form is great, but perhaps surprisingly (it's closed form after all!) It can actually be extremely expensive for large datasets (exercise: why?).

Also, closed form expressions only exist for certain optimization problems we are interested in for this course. As such, it is useful to now introduce two more general approaches to optimizating model parameters that we will see over and over again in this course.

Method 2: Gradient Descent

An intro to gradient descent in general

These notes are drawn from Dr. Chandola's notes on linear regression.

Gradient descent, also known as steepest descent, is an optimization algorithm for finding the local minimum of a function. To find a local minimum, the function "steps" in the direction of the negative of the gradient. Gradient ascent is the same as gradient descent, except that it steps in the direction of the positive of the gradient and therefore finds local maximums instead of minimums. The algorithm of gradient descent can be outlined as follows:

    1:   Choose initial guess ${\bf w}^{(0)}$
    2:   for k = 0, 1, 2, ... do
    3:       $s_k$ = -$\nabla f({\bf w}^{(k)})$
    4:       ${\bf w}^{(k+1)} = {\bf w}^{(k)} + \eta s_k$
    5:   end for

As a simple example, let's find a local minimum for the function $f(x) = x^3-2x^2+2$

Consider another function: $$ f(x,y) = 5\cos(x - 2) + 5\sin(x - 2y) $$ The derivative of this function will be: $$ \nabla f = \left[\begin{array}{c}-5\sin(x-2) + 5\cos(x - 2y) \\ -10\cos(x-2y) \end{array}\right] $$

Linear Regression with Gradient Descent

We are optimizing the linear regression loss function.

Method 3: Stochastic Gradient Descent

The one issue with gradient descent ... it still uses all the data! That is, we have to compute the loss and the gradient on the loss on all the data. This can be particularly problematic when the data is large, especially if it resides on different machines (e.g. in a Hadoop cluster).

It would be nice to instead use only a subset of the data, and still be sure that we get the optimal answer. Turns out, we can!

The stochastic gradient descent algorithm allows us to use single data points from our dataset to compute gradients and update our weights in the gradient descent algorithm. It is, in my humble opinion, actually kind of crazy that this works! I will not delve into the mathematics of why this works, although you can see here for a nice explanation. The general idea is that these points, if sampled uniformly, create a valid measure of the expectation of the gradient, and you can prove (with some effort) that you can use this expectation of the gradient to converge (with a few caveats).

You are not responsible for understanding that math for this course, but you will, in a future time period, be responsible for understanding how to implement stochastic gradient descent. More on that last part in future programming assignments :).

Two Miscellaneous Things

Numerical instability

$X^\top X$

The above experiment shows that if your data has redundancy (repeating values), the inverse of the $X^\top X$ matrix will be unstable. However, if you use sklearn LinearRegression implementation, you will still see a valid solution (See below).

Why?

Looking ahead: The problem with outliers

Impact of outliers

OLE is susceptible to outliers because of the square term in the loss function. For Bayesian regression, the issue arises because of the square term in the pdf of the Gaussian distribution. See below for alternate distributions.

Exercise (don't peak!): What would you do?

A solution: use a different model class!

Here, we could use a model called robust regression. The statsmodels package has a robust linear regression model function (rlm). We won't delve into the details, this is just note foreshadowing that we can address problems by changing the model class

Wrapping up

We now know what linear regression is (as a model/hypothesis class), and how to optimize it in three different ways.

There are two things that remain:

  1. How do we evaluate our model?
  2. How do we understand the coefficients in the model?

To address these, we'll return to our real-world example.